Finding Social Science Data for Your Research

Josh Quan

UC Berkeley Library

Fall 2017

An approximate answer to the right question is worth a great deal more than a precise answer to the wrong question.

-John Tukey

Plan your Research with a Literature Review

http://www.lib.berkeley.edu/

Ask Yourself…

  • How feasible or doable is your research question?

  • How many observations do you need?

  • Does the answer to your question have too many angles? If so, then your question might be too broad to answer on time.

Structure and availability of data

Unit of Analysis Geography Time-Period Frequency
For which level do you want data? Summary or Micro? (individuals, counties, nations) Is there a geographic component to your topic? (U.S., Sub-Saharan Africa, India) Do you want a data for a specific time period? (1980-2000, 1930-1960) How often do you want measures for your variables? (every year, every ten years, monthly, quarterly)

Data Providers

Researchers Government Agencies NGOs Research Organizations
Are there people you know who are doing this kind of research? Think about government agencies - is the request for some official statistics or data that they’d be likely to collect and publish? (industry, agriculture, construction, disease, crime) Are there councils or interest organizations devoted to the topic that might collect data independently? (HIV/AIDS, drugs, civil rights) Would any specific research organizations be interested in the topic? (Pew, Roper, Gallup, NORC, NBER, World Bank, OECD)

Mind the 80/20 Rule

It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data. -Dasu and Johnson, 2003

Tidy Data = Happy Data

“Happy families are all alike; every unhappy family is unhappy in its own way.” –– Leo Tolstoy

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham

Tidy Data = Happy Data

Tidy Data has the following attributes:

Each variable forms a column and contains values

Each observation forms a row

Each type of observational unit forms a table

Variable Naming

Good Example Bad Example Description
gnp2010 gnp-2002; gnp#2002
real_int real interest rate
score1; gnp2003 1st_score; 2003gnp
reg_out; glm1 REG; glm; ttest
invest; interest xxx; yyy; zmdje;
male; asian gender; race
citizen Are_you_a_US_citizen?
income; intUS03 INCOME; Int_us2003;
2017-04-20 April 20, 2017

Variable Naming

Good Example Bad Example Description
gnp2010 gnp-2002; gnp#2002 avoid special characters
real_int real interest rate Use underscore
score1; gnp2003 1st_score; 2003gnp Begin with a character
reg_out; glm1 REG; glm; ttest Avoid reserved words
invest; interest xxx; yyy; zmdje; Use meaningful names
male; asian gender; race Use a value of dummy
citizen Are_you_a_US_citizen? The shorter, the better
income; intUS03 INCOME; Int_us2003; Use lower cases
2017-04-20 April 20, 2017 Use common ISO year format

Missing values, Zeros, and Nulls

Web Scraping

http://statbel.fgov.be/en/statistics/figures/economy/indicators/prix_prod_con/

url='http://statbel.fgov.be/en/statistics/figures/economy/indicators/prix_prod_con/'
TAB=read_html(url)%>%html_nodes('td')%>%html_text()
NAMES=read_html(url)%>%html_nodes('th')%>%html_text()
M=data.frame(matrix(TAB,ncol=5,nrow=9,byrow=T))
M=cbind(NAMES[7:15],M)
names(M)=NAMES[1:6]
M
##   Gross indices (2010=100)     I    II   III     IV  Year
## 1                     2008  99.9 101.2 101.0  102.3 101.1
## 2                     2009 101.0  99.7 100.5   98.9 100.0
## 3                     2010  99.4  99.8 100.0  100.8 100.0
## 4                     2011 102.9 103.2 104.5  105.1 103.9
## 5                     2012 105.7 106.1 106.0  105.6 105.9
## 6                     2013 105.4 105.4 106.7  107.1 106.1
## 7                     2014 107.3 107.2 107.4  107.6 107.4
## 8                     2015 108.6 108.8 109.3  109.5 109.1
## 9                     2016 110.3 110.7 110,8  111,3 110.8

Text-mining

http://guides.lib.berkeley.edu/text-mining

D-Lab, Library Data Lab, Statistics Department

Questions?